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1. INTRODUCTION 

Speaker recognition and verification have achieved visibility and significance in society as speech 
technology, audio content, and e-commerce continues to expand. There is an increasing need to search audio 
content and start research based on the speaker identity is increasing interest to a young scientist. Imagining 
the future is not difficult when a system will expose our identity not only the sense of the intelligent, 
sympathetic and fully functional personal assistants, which we will say, but by our voice, we recognize more 
track-able or other recognizable symptoms. 

This is the additional basic information that we can not recognize the voice of a person once heard 
and at the same time, it is difficult to identify the voice of a known person on the telephone. In view of these 
thoughts, a native person may ponder what precisely makes speaker recognition such a difficult task and why 
is it a point of such thorough research. From the above discussion, we can say that the identity of the speaker 
can be completed in three steps. Any individual can easily recognize the familiar sounds of a person without 
any conscious training. These methods of recognition can be called as “Native Speaker Recognition”. In the 
forensic identification, a voice sample of a person from telephone calls database is often compared with 
potential suspects. In these cases, there are trained listeners in order to provide a decision. We will categorize 
this method as forensic speaker recognition. In this computer-based world, we have an automatic speaker 
recognition system, where an electronic machine is used to complete a speech analysis and automated 
decision-making. Forensic and ASR research communities have developed several methods for at least seven 
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decades independently. In contrast, native recognition is the natural ability of human beings which is always 
very effective and accurate. Recent research on brain imaging has shown many details that how a human 
being does cognitive-based speakers recognition, which can motivate new directions for both automated and 
forensic system [1, 2]. In this review paper, I present a as on date literature review of ASR systems, 
especially in the last seven decades, providing the reader with an attitude of how the forensic by the human 
speaker, especially the expert, and the native audience recognize. Its main purpose is to discuss three said 
sections of speaker recognition, which are important similarities and differences between them. I insist on 
how automatic speaker recognition system has been developed on more current approaches over time. In 
noise masking, many speech processing techniques, such as Mel scale filter bank for feature extraction and 
concepts, inspired by the human hearing system. Also, there are parallels between forensic voice experts and 
methods used by automated systems, however, in many cases, research communities are different. I believe 
that it required to include in this review, the perspective of the concept of speech by humans, including 
highlights of both strengths and weaknesses in speaker recognition system compared to machines, it will help 
readers to see and perhaps inspire new research in the field of the man-machine interface. 

In the first place, to consider the general research domain, it is valuable to elucidate what is 
enveloped by the term speaker recognition, which comprises of two task undertakings: verification and 
recognition. In speaker recognition, the undertaking is to distinguish an obscure speaker from an arrangement 
of known speakers. As it were, the objective is to find the speaker who sounds nearest to the speech coming 
from an obscure speaker inside a speech database. At the point when all speakers inside a given set are 
known as a closed set situation. On the other hand, if the potential information from outside the predefined 
known speaker gathering, this turns into an open-set situation, and, hence, a world model or universal 
background model (UBM) [3] is required. This situation is called open-set speaker recognition. 


2. THE MAIN CHALLENGES IN AUTOMATIC SPEAKER RECOGNITION SYSTEM IN 
PRESENT SCENARIO 

For example, like other biometric systems, iris, finger, face, and hand [4], the human voice is also a 
demonstration of the biometric system. The identity of the narrator is naturally embedded and specifically 
how a dialect is spoken for a person, not necessarily what is being said. This increases the possibility of 
speech signals with the degree of variability. 

If a person does not say the same word exactly the same way then it is called inter-speaker 
variability [4, 5]. In addition, various electronic devices used in recording and transmission methods usually 
increase the system complexity. A person may find it hard to identify a person’s voice through a mobile, or 
when a person suffers from cold and he/she is not healthy or he/she is performing another work in a stressed 
situation. The source of variability of speakers can be broadly classified into three categories: (i) Technology- 
based, (ii) Speaker-based, and (iii) Conversations based. 


2.1. Challenge and Opportunity in Speaker Recognition 

Technology is more to focus the initial efforts in speaker recognition, which includes 
telecommunications sector, where the communications channel and telephone handset variation was the main 
concern. Smartphone dominate the telecom industry, the variety of telephony landscape has expanded 
significantly. Speaker option available with all smartphone makes the user interact at a distance from the 
microphone, and this initiated a broad range of variability in the channel. The performance of speaker 
recognition system depends on intersession variability as well as the inherent changes present within human 
utterances recorded at the different session. However, the speaker recognition efficiency seems to be 
independent of time of voice samples collected for training and testing purpose [6, 7]. 

Most of the forensic speaker recognition uses in different legal scenarios are not very complicated. 
When adequate voice samples are available from the criminal, then methodical study can be done to extract 
the speaker specific properties, which are also called speaker specific feature parameter from voice data, and 
can be compared between the samples. In automatic speaker recognition system speaker-specific features 
were extracted from the speech signal and mathematically modeled to perform a meaningful comparison. 


2.2. Individual Characterization Based on Speaker Specific Features 

Every individual in the world has certain character traits in his/her speech that is unique. Speaking 
characteristics of an individual cannot be so different from the other, but mainly the speaker vocal tract is 
unique due to the physiology and due to the learning habits of expression. Even a twin has differences in his 
or her voice, though according to the research he or she has the same vocal tract size [8] and acoustical 
properties [9], and it is difficult to separate them from conceptual/forensic perspective [10, 11]. Thus, 
whether the speaker is identified by humans or machines, unambiguous aspects of some measurable and 
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predefined speaker-specific features should be considered for meaningful comparison in speech. In general, 
we prefer these characterizing aspects as feature parameters in human speech signal. 

No one can expect that a unique speech signal of a person should be unique features, but it is not 
always true. Let us consider two different speakers with equal speaking rate with s a suitable feature with 
differ pitch. It is complicated by the intra-variability and degradations discussed earlier, this is why many 
feature parameters are important. Nolan has reported in his article ideal speaker specific feature parameter 
must have these properties [12]: easy to extract and process, robust, high frequency of occurrence, highly 
resistive to attempted disguise or mimicry. Speaker-specific feature parameters can be classified into short- 
term versus long-term, linguistic versus nonlinguistic, and auditory versus acoustic features. There are 
strengths and weaknesses of auditory and acoustic features. Two samples of the speech signal may sound 
very similar, but acoustic parameters differ greatly [13]. 


3. FORENSIC SPEAKER RECOGNITION 

Identification of forensic speakers needs to recognize the problem occurs when you leave your voice 
as criminal evidence, a telephone recording or an audible speech by ear witness. Through the recognition 
technology, forensic speakers were discussed with speech waves that 1926 [14]. Later, spectrographic was 
developed representing speech at AT & T Bell Laboratories during World War II. Much later in 1970, when 
it came to be known as a voice print [15]. As the name shows, voice print has also been presented with 
fingerprints and very high expectations. 

Later, the reliability of voice printing for speech recognition to its operating system, the formal 
process, examined and fully supported [16, 17] which “is an idea that has gone wrong,” said [17]. Today, 
most researchers believe that it is better controversial. Voiceprint a chronological history is found in [18] and 
an overview of the discussion are found in forensic speaker recognition [19] here I present an overview of 
current trends [4]. Today, forensic recognition is performed by the expert generally phoneticians which are 
typically in the linguistic and statistical background. 


3.1. Different Approaches to Forensic Speaker Identification 

The described methods are done by human experts in whole or in part. While they are also 
considered for the forensic speaker recognition by the complete automated approach, we discuss the 
automatic identification of speakers in later sections. The auditory phonetician’s approach is based on human 
auditory system and based on their experience they produce a detailed transcript of the test samples. Forensic 
experts try to hear specimen sampling and detect any presence of unusual sounds, specific or noteworthy 
[20]. Expert experience is evidently the main aspect in scarce or typical decision-making. The above 
discussed auditory functions are used in this approach. 

As long as it is combined with other methods of hearing approach, it is completely subjective. 
Althoughthe Likelihood Ratio (LR) can be used to express results, forensic expert generally do not use the 
auditory approach. Instead, on the basis of their comparison of auditory actions, they present a statement of 
evidence in the court. The auditory spectrogram approach is derived from the voice known in the same word 
or phrase and their spectrograms are visually analyzed. After the debate over voiceprint, the spectrographic 
technique developed. If this explains, then forensic experts did not have the spectrographs separating 
variability by intraspeaker and interspeaker by a normal view assessment. So they have developed different 
protocols to analyze the aspects of pre-determined spectrographs that require the forensic examiner. 


3.2. Speaker Recognition by Human 

The skill to distinguish people by listening voice is a God’s gifted characteristics. It mentioned in 
the “Mahabharata” which some historians say was written in 400 BC that when Abhimanyu was in his 
mother's womb,Sri Krishna used to walk around Shubhadra. To humour her, Krishna used to relate many of 
his adventures to the pregnant Subhadra. On this excursion, Krishna described his experience with the 
Chakra-Vyu technique and how it could be inserted step by step in various circles could be penetrated. 
However, it seems that Subhadra did not find this interesting topic and fell asleep early. However, someone 
else was interested in the description of Shri Krishna so far Abhimanyu was not born. We use spectral 
features, including language, prosody, and lyrical style, to identify a number of different aspects of the 
human voice, to identify a person. Even without a conscious effort, do not forget to remember these features. 
There are various aspects in which the inexperienced listener is currently known about how to make specific 
speaker recognition based on these aspects (i) Voice segment identification (ii) Recognition and 
discrimination (iii) Language familiarity (iv) Abstract representation of speech. 


Int J Elec & Comp Eng, Vol. 8, No. 5, October 2018 : 2804 - 2811 


IJECE ISSN: 2088-8708 O 2807 


3.3. State-of-the-Art Automatic Speaker Recognition System 

ASR is a mathematical algorithm based computer system designed to recognise the voice of a 
speaker operated independently with minimum human intervention. The ASR system admin can adjust 
algorithm parameters, but to compare between speech segments, all users have to provide speech signal to the 
ASR system. In this paper, I concentrate attention on the text-independent ASR system and the speaker 
verification. As mentioned earlier, humans are good in differentiating voiced and non-voiced signal that is 
the important part in auditory forensic speaker recognition. Obviously, in ASR it is desirable that the speaker- 
specific feature can only be extracted from the voiced speech signal by voice activity detection (VAD) [21, 
22]. Detection and feature extraction from speech segment is important when considering the condition of 
excessive noise/degraded speech signal. Recently used VAD algorithm is explained in [21] although more 
accurate unsupervised solution Speech Activity Detected (SAD) has emerged as successful in various ASR 
applications in diverse audio condition [23]. 

Short-term speaker specific feature in ASR application shows the parameters extracted from the 
short segment of speech signal within 20-25 ms. In ASR application the most popular short-term acoustic 
features reported are the Mel-frequency cepstral coefficients (MFCCs) [24] and linear predictive coding 
(LPC) based features [25]. Steps involved in to obtain MFCC feature from speech signal are (i) Divide 
speech signal into short overlapping form (25 ms). (ii) Multiplication of these segments with Hamming and 
Hanning window function to get Fourier power spectrum (iii) Apply logarithm of the spectrum (iv) Apply 
nonlinear Mel-space filter-bank to obtain spectral energy in each channel (24 channel filter bank) (v) Apply 
discrete cosine transform (DCT) to obtain MFCC. As previously indicated, the specific speaker feature is the 
desirable qualities of the acoustic feature are robustness to degradation. The features normalization is one of 
the desirable characteristics of an ideal feature parameter [26]. 


4. MODELING OF STATE-OF-THE-ART ASR SYSTEM 

Converting audio segments into the functional parameter, after that modeling process started in 
ASR. In ASR modeling is a process flow to categories all speakers based on their characteristics. The model 
should also provide its meaning for comparison with unfamiliar speaker utterances. ASR modeling is called 
as robust when its speaker specific feature characterization process is not significantly affected by unwanted 
maladies, although these features are ideal if such features can be designed in such a way that interspeaker 
discrimination is maximum, then no intraspeaker variation exists and simple modeling methods can be 
sufficient. In short form, the non-ideal properties of the speaker specific feature extraction phase require 
different compensation techniques during the ASR modeling phase so that the effect of the disturbance 
variation present in the speech signal can be reduced during the testing of the speaker recognition process. 
Most of the ASR modeling techniques do different mathematical hypotheses about the speaker-specific 
features. If assumed properties are not met from the speech data, then we are basically presenting flaws even 
during the ASR modeling phase. 

The normalization of speaker-specific features can reduce these problems to some extent, but not 
completely. As a result, mathematical models are compelled to adopt the characteristics and speaker 
recognition scores are obtained based on these models and test speech data. Thus, in this process, the 
properties of detecting artifacts are introduced and a family of score standardization techniques has been 
proposed which is proposed to complete this final stage mismatch [27]. In essence, the decline in acoustic 
signal affects the speaker-specific features, patterns, and scores. Therefore, it is important to improve the 
robustness of ASR systems in all three domains. It has been mentioned recently that speaker modeling 
techniques have improved and score normalization techniques are not much effective [28, 29]. 


4.1. ASR System Based on Gaussian Mixture Model (GMM) 

When there is no prior knowledge of speech content in text-independent speaker recognition tasks, it 
has been found that GMM applications are more effective for acoustic modeling to shape short-term 
functionality. The average behavior of this is expected short-term spectral features are more dependent on 
speakers than being influenced by the temporary features. Therefore, even when the test data of ASR has a 
different acoustic situation, then due to GMM being a potential model it may be related to better data than the 
more restrictive Vector Quantization(VQ) model. A GMM is a mixture of Gaussian probability density 
functions (PDFs), parameterized by a number of mean vectors, covariance matrices, and weights of the 
individual mixture components. The template is a weighted sum of individual PDFs. The density of the 
Gaussian mixture is the weighted sum of M component densities and it represented mathematically: 


p(xla) = YM, pibi®) (1) 
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Where X represents D-dimension random vectors, component densities b,(X),i=1,..,M , and mixture 
weight represented by p;. Each component density is a D vector Gaussian function of the form 


x) = —+—_ -1@-ByY7R- Bh 
bi) = pyr P-R- Dit td} (2) 


Hi represents mean vector, },; represents covariance matrix. The complete density of the Gaussian mixture is 
parameterized by the mean vector, covariance matrix and mixture components of all density. These 
parameters are represented collectively by signaling 


A= {pi tp Xi} i = 1,..,M (3) 


For ASR system, each speaker is represented by one by the GMM and is referred to by his/her model A. The 
size of GMM may vary depending on the choice of covariance matrix. The GMM model can be evaluated 
using the probability of a vector attribute in eqn. (1). 


4.2. Support Vector Machines (SVMs) 

An SVM is a binary classifier that makes its decisions by constructing a linear decision boundary or 
hyperplane that optimally separates the two classes. Depending on its position in relation to Hyperplane, the 
model can be used to predict the class of unknown observation. Let us consider training vector and labels as 
(Xn, Yn) » Xn ERİ, yn € {-1,4+1}, n € {1, ...T} the optimal hyperplane is chosen according to the 
maximum margin criterion then target of SVM can be learn the function f: R > R so that the class labels of 
any unknown vector x can be expected as I(x) = sign(f(x)). 

For linearly separable data labeled [5, 30], hyperplain H can be obtained from xTx + b = 0, which 
separates the two class of data, so that y,(w'x, +b) = 1, n....T. An optimal linear divider H provides 
maximum margins between classes, i.e. the distance between H and the training of two different sections is 


highest in the data estimates. The maximum margin is found in the form of Tai and data points x, for which 


y,(w'x, +b) = 1 that the margin is known as super vectors. When ASR training data is not linearly 
separable, then speaker specific features can be mapped to a higher dimensional space, in which kernel 
functions are linearly divided. 


4.3. Factor Analysis (FA) of the GMM Supervectors 

The purpose of the FA is to describe variability in high dimensional observable data vector using 
less number of unobservable/hidden variables. For ASR application, the idea of explaining peaker’s and 
channel-dependent variability in the GMM supervector space, FA has been used in [31]. Many forms of FA 
methods have been employed since, which ultimately brought the current state of the art i-vector approach. In 
a linear distortion model, a speaker-dependent GMM supervisor m, is generally considered as four 
component which are linear in nature. 


Msh = Mo + Mspk + Meh + Mres (4) 


Where my is speaker channel environment-independent component, m,p, is speaker dependant component, 
Meh is channel environment dependant component and mye, is residual. The joint FA (JFA) model is 
prepared in conjunction with eigenvoice and eigenchannel, which is achieved with a MAP optimization for a 
model. The sub-spaces are aligned by V and U matrix, as the first model recommends for an informal choice 
of speakers s and sessions h, mean supervector of GMM can be represented by 


Msh = Mo + Uxn + Ws + Dzsh (5) 


So now this is the only model, which we are considering all the four components of linear distortion model 
we discussed earlier. In fact, JFA has been shown to overcome other current method. 


4.4. i-Vector Approach 

In an effort to unify the strength of these two methods, modern ASR systems attempted to utilize 
JFA as a speaker specific feature extractor by Dehak et al. [32] for SVM. In the initial effort speaker factors 
estimation JFA were used as speaker specific feature for the SVM classifiers. Keeping in mind that even 
channel factors have information of speakers and the channel has been added to a single space, called total 
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variability space [33]. The FA model that depends on, speaker and session is represented by a GMM 
supervisor as 


Msh = Mo + Twsh (6) 


Tws,n is called total factor. Like all the FA methods described above, hidden variables are not overlooked, but 
their posterior expectation can be estimated. The total factor estimate, which can be used as features in the 
next stage of the classifier named as i-vectors. 


4.5. Linear Discriminant Analysis (LDA) approach 

LDA is a commonly employed technique in statistical pattern recognition that aims at finding linear 
combinations of feature coefficients to facilitate discrimination of multiple classes. It finds orthogonal 
orientation in place of most effective functions in class discrimination. By introducing the original features in 
these guidelines, the accuracy of classification improves. Let us indicate set of all development utterances by 
D, utterance features indicated by w,;, these features obtained from the ith utterance of speaker s, the total 
number of utterances belonging to s is indicated by n, and total number of speakers in D is indicated by S. 
Class covariance matrices between Sj, and within Sẹ are given by 


1 = = — 

Sp = s 8-1 (Ws Hä w) (Ws a w)" (7) 
1 1 ons = — \T 

Sw = 5 us=1 p Disa (Ws; = W;) (Ws; = Ws) (8) 


Where the speaker dependant mean vector is given by W, = 1/n, XS ws; and speaker independent mean 
E, eer 1 f A EE ET er 
vector is given by W = gus=1 5 Dict Ws; respectively. The LDA optimization is therefore to maximize 


between class variance, whereas reducing within the class variance. The exact estimation can be obtain from 
this optimization by solving generalized eigenvalue problem: 


Spy =A SwV (9) 


The diagonal matrix containing of eignvector is indicated by A. If the matrix Sw in eqn. (8) is invertible then 
the solution can be easily found by SẸ*Sp. Apa matrix of dimension R x k is as follows 


Arpa = [V1 Vg] (10) 


k eigenvectors V4 ........V, obtained by solving eqn. (9). Thus, the LDA change of the utterance feature w is 
obtained in this way 


ipa (w) = Atpaw (11) 


4.6. Nuisance Attribute Projection (NAP) 

The application of NAP algorithm in ASR reported in [34]. In NAP technique the speaker specific 
feature space is replaced by complementary channel space using an orthogonal projection, which depends 
only on the speaker. The projection matrix of size d x d is calculated using covariance matrix of co-rank 
k < d as P = I — uik] uik]: The low rank rectangular matrix uj; whose column is k principal eigenvectors of 


the within-class covariance matrix Sw in eqn. (8). The NAP is performed on w as Ọyap (w) = Pw. 


4.7. Within-Class Covariance Normalization (WCCN) 

The main goal of WCCN normalization to improve the robustness of the SVM-based ASR 
framework [35] using a consistent opposite decision approach. The aim of the WCCN launch is to reduce 
false alarm rates and miss-errors rates during SVM training. Covariance matrix within-class Sw is calculated 
using eqn. (8) and projection on WCCN is performed as Pwcen (w) = Alyccyw. With the help of Cholesky 
factorization of S% Awccy is computed as Sy? = AwccnAwccn: Unlike LDA and NAP, the projection of 
WCCN easily converses the feature space. 
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5. ASR PERFORMANCE EVALUATION IN STANDARD SPEECH DATA SETS AND TYPES 
OF ERROR 

Performance evaluation of ASR system is one of the main aspects of the research cycle. It is 
strongly dependent on the variability of the voice signal, noise and distortion in the communication channel. 
Recognition has to face many problems: unrestricted input speech, non-co-operative speaker and 
uncontrolled environmental norms. There are two types of errors may occur in such decision making 
processes in ASR system (i) false rejection (in other words non-detection), that is, the system disapproves a 
genuine identity claim of a speaker under scrutiny and (ii) false acceptance (in other words false alarm), that 
is, the system approves the identity claim of an impostor. 

These errors are quantified as performance measures of a security system. They are (i) False 
Rejection Rate (FRR), which indicates the percentage of incorrectly rejected clients and (ii) False Acceptance 
Rate (FAR). In a real life situation, a biometric security system, which is usually imperfect, the characteristic 
curves of FRR and FAR intersect at a certain point called ‘Equal Error Rate (EER)’. If one fixes a very low 
threshold value, then the system would exhibit very low FRR and very high FAR and accept all identity 
claims. Alternatively, if one fixes a very high threshold value, then the system would exhibit very high FRR 
and very low FAR and reject all identity claims. In this context, one could plot a curve called ‘Receiver 
Operating Characteristic (ROC)’, which involves FRR and FAR. ROC curve is a graphical indication of the 
system performance. 

As mentioned above, EER does not distinguish between two types of errors which are sometimes 
unrealistic performance evaluation of ASR. Therefore, the detection cost function (DCF) introduces the 
numerical/penalty cost for two types of errors. The priori probability of encountering a target speaker 
provides priority and DCF is calculated as the decision threshold value as DCF(t) = CyssP(T)Prarget + 
Cra Pra (T) (1 — Piarger); Where Cost of a miss/FR error is indicated by Cyjss, Cost of an FA error is indicated 
by Cra, Prior probability of target speaker is indicated by Prarget, Probability of (MISS|Target, Threshold = 
T) is indicated by Pyj;,(t) and Probability of (FA|nontarger, Threshold = T) is indicated by Pg, (T). 

The above three quantities in NIST SRE 2008 Cyjss = 10, Cpa =1 and Prarget = 0.01 are 
predefined. In general, the goal of the ASR system designer is to find the optimum threshold value which 
reduces the DCF. Now, the prior value Prarget = 0.01 indicates that ASR system will be detected after every 
100 attempts to check the speaker. When the speaker recognition performance is evaluated in different 
operational points, then the error detection curve (DET) is usually used. DET curve is a FAR error plot 
compared to FRR/miss. When the performance of the ASR system improves, the curve moves toward origin. 
The DET curve nearest to origin represents a better ASR system. 


6. CONCLUSION 

There is still a lot of work to fully understand the way to decide on the content of human brain 
speech and speakers. However, what we know, it can be said that the ASR system should focus on improving 
performance, more on high-level speaker-specific features. Human beings are effective in the identification 
of unique speakers; they know it very well, while ASR systems can only learn a specific section if a 
measurable function parameter can be defined correctly. A large number of automated systems audio is better 
in researching and possibly, more effective to reduce the likelihood of those audio samples being speakers 
matches; while humans are better to compare a smaller subgroup and do not match the microphone or 
channel more easily. It can be useful to check exactly what the “know” of a speaker means from a 
perspective of a useful system. The discovery of alternative compact speaker representations and audio 
segments that emphasize relevant identification parameters, while eliminating nuisance components will 
always be a continuous challenge for state-of-the-art ASR system developers. 
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